Why was this cited? Explainable Machine Learning Applied to COVID-19 Research Articles

The notebook contains code for all models that were created for the purpose of getting results for the article: Why was this cited? Explainable Machine Learning Applied to COVID-19 Research Articles.

Expected directory structure

Inputs of this notebook - description

DATASET VERSION 1

1) metadata_with_opencitations.csv

2) biblio.json

3) author_names_info.csv

DATASET VERSION 2

1) df_sw_tok_low_punc_lemm_v5.csv

2) citationcounts_oci_revised.csv

RESULTS FOR DATASET VERSION 1

Libraries

Parameters

Source data

documents from: https://allenai.org/data/cord-19

Plot citations

Bibliometric features

Create several versions of matrices

This operation can be performed on complete dataset without risk of leakage, because the vectorization performed is binary

TF-IDF for Random forest

BOW matrix

Binary document matrix with scispacy entities

This operation can be performed on complete dataset without risk of leakage

Add ConceptNet entities

PubTator

note that "full data" was renamed to "full_data" PubTator processing uses all text nodes shorter than 20 characters

Plot citations

Author names matrix

1) Bibliomatric features matrix for Random forest

2) Bibliomatric features matrix for Rulemining

Connected matrices

Reduce number of rows (to have same number of documents everywhere - like in bibliometric features)

Optimal SIZE of matrices

BERT

BERT TOKENIZATION - FOR RANDOM FOREST

BERT tokenizer

BERT model (BERT embeddings)

Random forest with BERT - classifier

Random forest with BERT - regression

BERT tokenizer for neural network

Neural Network modelling

Preprocess data:

Random forest

Grid

Running to get FI for all matrices

Running to get results for reduced matrices

SHAP

LIME

RULE MINING MODELS

For dataset version 1.

CORELS

CBA

Author names analysis

Try to detect number of unique author names

Conclusion: it is not possible to detect number of unique names, because many same names are written in different ways.

Plot 1

Statistical tests:

Plot 2